
Yulu is India’s leading micro-mobility service provider, which offers unique vehicles for the daily commute. Starting off as a mission to eliminate traffic congestion in India, Yulu provides the safest commute solution through a user-friendly mobile app to enable shared, solo and sustainable commuting.
Yulu zones are located at all the appropriate locations (including metro stations, bus stands, office spaces, residential areas, corporate offices, etc) to make those first and last miles smooth, affordable, and convenient!
Yulu has recently suffered considerable dips in its revenues. They have contracted a consulting company to understand the factors on which the demand for these shared electric cycles depends. Specifically, they want to understand the factors affecting the demand for these shared electric cycles in the Indian market.
#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm, ttest_ind , levene, shapiro, f_oneway, kruskal, chi2_contingency, chisquare, ttest_1samp, probplot
from statsmodels.graphics.gofplots import qqplot
import warnings
warnings.filterwarnings('ignore')
import copy
df = pd.read_csv('bike_sharing.csv')
df.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 |
| 1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 |
| 2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 |
| 3 | 2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 |
| 4 | 2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 |
The dataset has the following features:
weather:
1: Clear, Few clouds, partly cloudy, partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp: temperature in Celsius
df.shape
(10886, 12)
df.ndim
2
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10886 entries, 0 to 10885 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 10886 non-null object 1 season 10886 non-null int64 2 holiday 10886 non-null int64 3 workingday 10886 non-null int64 4 weather 10886 non-null int64 5 temp 10886 non-null float64 6 atemp 10886 non-null float64 7 humidity 10886 non-null int64 8 windspeed 10886 non-null float64 9 casual 10886 non-null int64 10 registered 10886 non-null int64 11 count 10886 non-null int64 dtypes: float64(3), int64(8), object(1) memory usage: 1020.7+ KB
null_values = df.isnull().sum()
print("Total No.Of Null Values:")
print(null_values)
Total No.Of Null Values: datetime 0 season 0 holiday 0 workingday 0 weather 0 temp 0 atemp 0 humidity 0 windspeed 0 casual 0 registered 0 count 0 dtype: int64
duplicated_values = df.duplicated().sum()
print("Total No.Of Duplicated Values : ", duplicated_values)
Total No.Of Duplicated Values : 0
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| season | 10886.0 | 2.506614 | 1.116174 | 1.00 | 2.0000 | 3.000 | 4.0000 | 4.0000 |
| holiday | 10886.0 | 0.028569 | 0.166599 | 0.00 | 0.0000 | 0.000 | 0.0000 | 1.0000 |
| workingday | 10886.0 | 0.680875 | 0.466159 | 0.00 | 0.0000 | 1.000 | 1.0000 | 1.0000 |
| weather | 10886.0 | 1.418427 | 0.633839 | 1.00 | 1.0000 | 1.000 | 2.0000 | 4.0000 |
| temp | 10886.0 | 20.230860 | 7.791590 | 0.82 | 13.9400 | 20.500 | 26.2400 | 41.0000 |
| atemp | 10886.0 | 23.655084 | 8.474601 | 0.76 | 16.6650 | 24.240 | 31.0600 | 45.4550 |
| humidity | 10886.0 | 61.886460 | 19.245033 | 0.00 | 47.0000 | 62.000 | 77.0000 | 100.0000 |
| windspeed | 10886.0 | 12.799395 | 8.164537 | 0.00 | 7.0015 | 12.998 | 16.9979 | 56.9969 |
| casual | 10886.0 | 36.021955 | 49.960477 | 0.00 | 4.0000 | 17.000 | 49.0000 | 367.0000 |
| registered | 10886.0 | 155.552177 | 151.039033 | 0.00 | 36.0000 | 118.000 | 222.0000 | 886.0000 |
| count | 10886.0 | 191.574132 | 181.144454 | 1.00 | 42.0000 | 145.000 | 284.0000 | 977.0000 |
df["datetime"] = pd.to_datetime(df['datetime'])
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10886 entries, 0 to 10885 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 10886 non-null datetime64[ns] 1 season 10886 non-null int64 2 holiday 10886 non-null int64 3 workingday 10886 non-null int64 4 weather 10886 non-null int64 5 temp 10886 non-null float64 6 atemp 10886 non-null float64 7 humidity 10886 non-null int64 8 windspeed 10886 non-null float64 9 casual 10886 non-null int64 10 registered 10886 non-null int64 11 count 10886 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(8) memory usage: 1020.7 KB
cat_column = ["season","holiday","workingday","weather"]
for i in cat_column:
df[i] = df[i].astype("object")
print(f"Minimum Time Entry : { df['datetime'].min() }")
print(f"Maximum Time Entry : { df['datetime'].max() }")
Minimum Time Entry : 2011-01-01 00:00:00 Maximum Time Entry : 2012-12-19 23:00:00
print(f"Date Range : {df['datetime'].max() - df['datetime'].min()} ")
Date Range : 718 days 23:00:00
df["season"] = df["season"].replace({1: "Spring", 2: "Summer" , 3: "Fall" , 4 : "Winter"})
df["season"].unique()
array(['Spring', 'Summer', 'Fall', 'Winter'], dtype=object)
df["holiday"] = df["holiday"].replace({1 : "Yes" , 0 :"No"})
df["holiday"].unique()
array(['No', 'Yes'], dtype=object)
df["workingday"] = df["workingday"].replace({1 : "Yes" , 0 :"No"})
df["workingday"].unique()
array(['No', 'Yes'], dtype=object)
for i in df.columns:
print(f"Unique Value of {i}:")
print(df[i].unique())
print("-"*50)
print(f"Total No of {i}: ", df[i].nunique())
print("-"*50)
print(f"Value Count of {i}:")
print(df[i].value_counts().sort_values(ascending= False).head(5))
print("-"*50)
print()
Unique Value of datetime: <DatetimeArray> ['2011-01-01 00:00:00', '2011-01-01 01:00:00', '2011-01-01 02:00:00', '2011-01-01 03:00:00', '2011-01-01 04:00:00', '2011-01-01 05:00:00', '2011-01-01 06:00:00', '2011-01-01 07:00:00', '2011-01-01 08:00:00', '2011-01-01 09:00:00', ... '2012-12-19 14:00:00', '2012-12-19 15:00:00', '2012-12-19 16:00:00', '2012-12-19 17:00:00', '2012-12-19 18:00:00', '2012-12-19 19:00:00', '2012-12-19 20:00:00', '2012-12-19 21:00:00', '2012-12-19 22:00:00', '2012-12-19 23:00:00'] Length: 10886, dtype: datetime64[ns] -------------------------------------------------- Total No of datetime: 10886 -------------------------------------------------- Value Count of datetime: datetime 2011-01-01 00:00:00 1 2011-01-01 07:00:00 1 2011-01-01 21:00:00 1 2011-01-01 20:00:00 1 2011-01-01 19:00:00 1 Name: count, dtype: int64 -------------------------------------------------- Unique Value of season: ['Spring' 'Summer' 'Fall' 'Winter'] -------------------------------------------------- Total No of season: 4 -------------------------------------------------- Value Count of season: season Winter 2734 Summer 2733 Fall 2733 Spring 2686 Name: count, dtype: int64 -------------------------------------------------- Unique Value of holiday: ['No' 'Yes'] -------------------------------------------------- Total No of holiday: 2 -------------------------------------------------- Value Count of holiday: holiday No 10575 Yes 311 Name: count, dtype: int64 -------------------------------------------------- Unique Value of workingday: ['No' 'Yes'] -------------------------------------------------- Total No of workingday: 2 -------------------------------------------------- Value Count of workingday: workingday Yes 7412 No 3474 Name: count, dtype: int64 -------------------------------------------------- Unique Value of weather: [1 2 3 4] -------------------------------------------------- Total No of weather: 4 -------------------------------------------------- Value Count of weather: weather 1 7192 2 2834 3 859 4 1 Name: count, dtype: int64 -------------------------------------------------- Unique Value of temp: [ 9.84 9.02 8.2 13.12 15.58 14.76 17.22 18.86 18.04 16.4 13.94 12.3 10.66 6.56 5.74 7.38 4.92 11.48 4.1 3.28 2.46 21.32 22.96 23.78 24.6 19.68 22.14 20.5 27.06 26.24 25.42 27.88 28.7 30.34 31.16 29.52 33.62 35.26 36.9 32.8 31.98 34.44 36.08 37.72 38.54 1.64 0.82 39.36 41. ] -------------------------------------------------- Total No of temp: 49 -------------------------------------------------- Value Count of temp: temp 14.76 467 26.24 453 28.70 427 13.94 413 18.86 406 Name: count, dtype: int64 -------------------------------------------------- Unique Value of atemp: [14.395 13.635 12.88 17.425 19.695 16.665 21.21 22.725 21.97 20.455 11.365 10.605 9.85 8.335 6.82 5.305 6.06 9.09 12.12 7.575 15.91 3.03 3.79 4.545 15.15 18.18 25. 26.515 27.275 29.545 23.485 25.76 31.06 30.305 24.24 18.94 31.82 32.575 33.335 28.79 34.85 35.605 37.12 40.15 41.665 40.91 39.395 34.09 28.03 36.365 37.88 42.425 43.94 38.635 1.515 0.76 2.275 43.18 44.695 45.455] -------------------------------------------------- Total No of atemp: 60 -------------------------------------------------- Value Count of atemp: atemp 31.060 671 25.760 423 22.725 406 20.455 400 26.515 395 Name: count, dtype: int64 -------------------------------------------------- Unique Value of humidity: [ 81 80 75 86 76 77 72 82 88 87 94 100 71 66 57 46 42 39 44 47 50 43 40 35 30 32 64 69 55 59 63 68 74 51 56 52 49 48 37 33 28 38 36 93 29 53 34 54 41 45 92 62 58 61 60 65 70 27 25 26 31 73 21 24 23 22 19 15 67 10 8 12 14 13 17 16 18 20 85 0 83 84 78 79 89 97 90 96 91] -------------------------------------------------- Total No of humidity: 89 -------------------------------------------------- Value Count of humidity: humidity 88 368 94 324 83 316 87 289 70 259 Name: count, dtype: int64 -------------------------------------------------- Unique Value of windspeed: [ 0. 6.0032 16.9979 19.0012 19.9995 12.998 15.0013 8.9981 11.0014 22.0028 30.0026 23.9994 27.9993 26.0027 7.0015 32.9975 36.9974 31.0009 35.0008 39.0007 43.9989 40.9973 51.9987 46.0022 50.0021 43.0006 56.9969 47.9988] -------------------------------------------------- Total No of windspeed: 28 -------------------------------------------------- Value Count of windspeed: windspeed 0.0000 1313 8.9981 1120 11.0014 1057 12.9980 1042 7.0015 1034 Name: count, dtype: int64 -------------------------------------------------- Unique Value of casual: [ 3 8 5 0 2 1 12 26 29 47 35 40 41 15 9 6 11 4 7 16 20 19 10 13 14 18 17 21 33 23 22 28 48 52 42 24 30 27 32 58 62 51 25 31 59 45 73 55 68 34 38 102 84 39 36 43 46 60 80 83 74 37 70 81 100 99 54 88 97 144 149 124 98 50 72 57 71 67 95 90 126 174 168 170 175 138 92 56 111 89 69 139 166 219 240 147 148 78 53 63 79 114 94 85 128 93 121 156 135 103 44 49 64 91 119 167 181 179 161 143 75 66 109 123 113 65 86 82 132 129 196 142 122 106 61 107 120 195 183 206 158 137 76 115 150 188 193 180 127 154 108 96 110 112 169 131 176 134 162 153 210 118 141 146 159 178 177 136 215 198 248 225 194 237 242 235 224 236 222 77 87 101 145 182 171 160 133 105 104 187 221 201 205 234 185 164 200 130 155 116 125 204 186 214 245 218 217 152 191 256 251 262 189 212 272 223 208 165 229 151 117 199 140 226 286 352 357 367 291 233 190 283 295 232 173 184 172 320 355 326 321 354 299 227 254 260 207 274 308 288 311 253 197 163 275 298 282 266 220 241 230 157 293 257 269 255 228 276 332 361 356 331 279 203 250 259 297 265 267 192 239 238 213 264 244 243 246 289 287 209 263 249 247 284 327 325 312 350 258 362 310 317 268 202 294 280 216 292 304] -------------------------------------------------- Total No of casual: 309 -------------------------------------------------- Value Count of casual: casual 0 986 1 667 2 487 3 438 4 354 Name: count, dtype: int64 -------------------------------------------------- Unique Value of registered: [ 13 32 27 10 1 0 2 7 6 24 30 55 47 71 70 52 26 31 25 17 16 8 4 19 46 54 73 64 67 58 43 29 20 9 5 3 63 153 81 33 41 48 53 66 146 148 102 49 11 36 92 177 98 37 50 79 68 202 179 110 34 87 192 109 74 65 85 186 166 127 82 40 18 95 216 116 42 57 78 59 163 158 51 76 190 125 178 39 14 15 56 60 90 83 69 28 35 22 12 77 44 38 75 184 174 154 97 214 45 72 130 94 139 135 197 137 141 156 117 155 134 89 80 108 61 124 132 196 107 114 172 165 105 119 183 175 88 62 86 170 145 217 91 195 152 21 126 115 223 207 123 236 128 151 100 198 157 168 84 99 173 121 159 93 23 212 111 193 103 113 122 106 96 249 218 194 213 191 142 224 244 143 267 256 211 161 131 246 118 164 275 204 230 243 112 238 144 185 101 222 138 206 104 200 129 247 140 209 136 176 120 229 210 133 259 147 227 150 282 162 265 260 189 237 245 205 308 283 248 303 291 280 208 286 352 290 262 203 284 293 160 182 316 338 279 187 277 362 321 331 372 377 350 220 472 450 268 435 169 225 464 485 323 388 367 266 255 415 233 467 456 305 171 470 385 253 215 240 235 263 221 351 539 458 339 301 397 271 532 480 365 241 421 242 234 341 394 540 463 361 429 359 180 188 261 254 366 181 398 272 167 149 325 521 426 298 428 487 431 288 239 453 454 345 417 434 278 285 442 484 451 252 471 488 270 258 264 281 410 516 500 343 311 432 475 479 355 329 199 400 414 423 232 219 302 529 510 348 346 441 473 335 445 555 527 273 364 299 269 257 342 324 226 391 466 297 517 486 489 492 228 289 455 382 380 295 251 418 412 340 433 231 333 514 483 276 478 287 381 334 347 320 493 491 369 201 408 378 443 460 465 313 513 292 497 376 326 413 328 525 296 452 506 393 368 337 567 462 349 319 300 515 373 399 507 396 512 503 386 427 312 384 530 310 536 437 505 371 375 534 469 474 553 402 274 523 448 409 387 438 407 250 459 425 422 379 392 430 401 306 370 449 363 389 374 436 356 317 446 294 508 315 522 494 327 495 404 447 504 318 579 551 498 533 332 554 509 573 545 395 440 547 557 623 571 614 638 628 642 647 602 634 648 353 322 357 314 563 615 681 601 543 577 354 661 653 304 645 646 419 610 677 618 595 565 586 670 656 626 581 546 604 596 383 621 564 309 360 330 549 589 461 631 673 358 651 663 538 616 662 344 640 659 770 608 617 584 307 667 605 641 594 629 603 518 665 769 749 499 719 734 696 688 570 675 405 411 643 733 390 680 764 679 531 637 652 778 703 537 576 613 715 726 598 625 444 672 782 548 682 750 716 609 698 572 669 633 725 704 658 620 542 575 511 741 790 644 740 735 560 739 439 660 697 336 619 712 624 580 678 684 468 649 786 718 775 636 578 746 743 481 664 711 689 751 745 424 699 552 709 591 757 768 767 723 558 561 403 502 692 780 622 761 690 744 857 562 702 802 727 811 886 406 787 496 708 758 812 807 791 639 781 833 756 544 789 742 655 416 806 773 737 706 566 713 800 839 779 766 794 803 788 720 668 490 568 597 477 583 501 556 593 420 541 694 650 559 666 700 693 582] -------------------------------------------------- Total No of registered: 731 -------------------------------------------------- Value Count of registered: registered 3 195 4 190 5 177 6 155 2 150 Name: count, dtype: int64 -------------------------------------------------- Unique Value of count: [ 16 40 32 13 1 2 3 8 14 36 56 84 94 106 110 93 67 35 37 34 28 39 17 9 6 20 53 70 75 59 74 76 65 30 22 31 5 64 154 88 44 51 61 77 72 157 52 12 4 179 100 42 57 78 97 63 83 212 182 112 54 48 11 33 195 115 46 79 71 62 89 190 169 132 43 19 95 219 122 45 86 172 163 69 23 7 210 134 73 50 87 187 123 15 25 98 102 55 10 49 82 92 41 38 188 47 178 155 24 18 27 99 217 130 136 29 128 81 68 139 137 202 60 162 144 158 117 90 159 101 118 129 26 104 91 113 105 21 80 125 133 197 109 161 135 116 176 168 108 103 175 147 96 220 127 205 174 121 230 66 114 216 243 152 199 58 166 170 165 160 140 211 120 145 256 126 223 85 206 124 255 222 285 146 274 272 185 191 232 327 224 107 119 196 171 214 242 148 268 201 150 111 167 228 198 204 164 233 257 151 248 235 141 249 194 259 156 153 244 213 181 221 250 304 241 271 282 225 253 237 299 142 313 310 207 138 280 173 332 331 149 267 301 312 278 281 184 215 367 349 292 303 339 143 189 366 386 273 325 356 314 343 333 226 203 177 263 297 288 236 240 131 452 383 284 291 309 321 193 337 388 300 200 180 209 354 361 306 277 428 362 286 351 192 411 421 276 264 238 266 371 269 537 518 218 265 459 186 517 544 365 290 410 396 296 440 533 520 258 450 246 260 344 553 470 298 347 373 436 378 342 289 340 382 390 358 385 239 374 598 524 384 425 611 550 434 318 442 401 234 594 527 364 387 491 398 270 279 294 295 322 456 437 392 231 394 453 308 604 480 283 565 489 487 183 302 547 513 454 486 467 572 525 379 502 558 564 391 293 247 317 369 420 451 404 341 251 335 417 363 357 438 579 556 407 336 334 477 539 551 424 346 353 481 506 432 409 466 326 254 463 380 275 311 315 360 350 252 328 476 227 601 586 423 330 569 538 370 498 638 607 416 261 355 552 208 468 449 381 377 397 492 427 461 422 305 375 376 414 447 408 418 457 545 496 368 245 596 563 443 562 229 316 402 287 372 514 472 511 488 419 595 578 400 348 587 497 433 475 406 430 324 262 323 412 530 543 413 435 555 523 441 529 532 585 399 584 559 307 582 571 426 516 465 329 483 600 570 628 531 455 389 505 359 431 460 590 429 599 338 566 482 568 540 495 345 591 593 446 485 393 500 473 352 320 479 444 462 405 620 499 625 395 528 319 519 445 512 471 508 526 509 484 448 515 549 501 612 597 464 644 712 676 734 662 782 749 623 713 746 651 686 690 679 685 648 560 503 521 554 541 721 801 561 573 589 729 618 494 757 800 684 744 759 822 698 490 536 655 643 626 615 567 617 632 646 692 704 624 656 610 738 671 678 660 658 635 681 616 522 673 781 775 576 677 748 776 557 743 666 813 504 627 706 641 575 639 769 680 546 717 710 458 622 705 630 732 770 439 779 659 602 478 733 650 873 846 474 634 852 868 745 812 669 642 730 672 645 694 493 668 647 702 665 834 850 790 415 724 869 700 793 723 534 831 613 653 857 719 867 823 403 693 603 583 542 614 580 811 795 747 581 722 689 849 872 631 649 819 674 830 814 633 825 629 835 667 755 794 661 772 657 771 777 837 891 652 739 865 767 741 469 605 858 843 640 737 862 810 577 818 854 682 851 848 897 832 791 654 856 839 725 863 808 792 696 701 871 968 750 970 877 925 977 758 884 766 894 715 783 683 842 774 797 886 892 784 687 809 917 901 887 785 900 761 806 507 948 844 798 827 670 637 619 592 943 838 817 888 890 788 588 606 608 691 711 663 731 708 609 688 636] -------------------------------------------------- Total No of count: 822 -------------------------------------------------- Value Count of count: count 5 169 4 149 3 144 6 135 2 132 Name: count, dtype: int64 --------------------------------------------------
mean =round(df["casual"].mean())
median = round(df["casual"].median())
diff = round((mean - median))
print (f"Casual User column has a significant difference between mean ({mean}) and median ({median}) and the difference({diff}), indicating potential skewness or outliers.")
Casual User column has a significant difference between mean (36) and median (17) and the difference(19), indicating potential skewness or outliers.
mean =round(df["registered"].mean())
median = round(df["registered"].median())
diff = round((mean - median))
print (f"Registered User column has a significant difference between mean ({mean}) and median ({median}) and the difference({diff}), indicating potential skewness or outliers.")
Registered User column has a significant difference between mean (156) and median (118) and the difference(38), indicating potential skewness or outliers.
mean =round(df["count"].mean())
median = round(df["count"].median())
diff = round((mean - median))
print (f" User column has a significant difference between mean ({mean}) and median ({median}) and the difference({diff}), indicating potential skewness or outliers.")
User column has a significant difference between mean (192) and median (145) and the difference(47), indicating potential skewness or outliers.
columns = ["windspeed", "casual", "registered", "count"]
colors = sns.color_palette("bright", n_colors=len(columns))
plt.figure(figsize=(15, 5))
for i, j in enumerate(columns):
plt.subplot(1, 4, i + 1)
sns.boxplot(data=df, y=j, palette=[colors[i]])
plt.title(f"Outliers in {j}")
plt.suptitle("Checking for Outliers in the given Dataset", fontsize=30)
plt.tight_layout()
plt.show()
col = ["windspeed", "casual", "registered", "count"]
for i in col:
Q1 = round(np.percentile(df[i],25),2)
Q3 = round(np.percentile(df[i],75),2)
IQR = Q3 - Q1
print(f"25th Percetile of {i} is {Q1}")
print(f"75th Percetile of {i} is {Q3}")
print(f"IQR of {i} is {IQR}")
Upper_bound = Q3 + (1.5 * IQR)
Lower_bound = Q1 - (1.5 * IQR)
print(f"Upper Bound of {i} is {Upper_bound}")
print(f"Lower Bound of {i} is {Lower_bound}")
Outlier_percentage = round((len(df.loc[df[i] > Upper_bound])/len(df))*100,2)
print(f"Outlier Percentage of {i} is {Outlier_percentage}")
print("*"*40)
25th Percetile of windspeed is 7.0 75th Percetile of windspeed is 17.0 IQR of windspeed is 10.0 Upper Bound of windspeed is 32.0 Lower Bound of windspeed is -8.0 Outlier Percentage of windspeed is 2.09 **************************************** 25th Percetile of casual is 4.0 75th Percetile of casual is 49.0 IQR of casual is 45.0 Upper Bound of casual is 116.5 Lower Bound of casual is -63.5 Outlier Percentage of casual is 6.88 **************************************** 25th Percetile of registered is 36.0 75th Percetile of registered is 222.0 IQR of registered is 186.0 Upper Bound of registered is 501.0 Lower Bound of registered is -243.0 Outlier Percentage of registered is 3.89 **************************************** 25th Percetile of count is 42.0 75th Percetile of count is 284.0 IQR of count is 242.0 Upper Bound of count is 647.0 Lower Bound of count is -321.0 Outlier Percentage of count is 2.76 ****************************************
# Creating a function to give values in chart:
def add_labels(bar):
for container in bar.containers:
bar.bar_label(container)
columns = ["season", "workingday", "weather", "holiday"]
plt.figure(figsize=(15, 22))
for i, col in enumerate(columns, 1):
plt.subplot(4, 2, 2*i-1)
labels = df[col].value_counts().index
values = df[col].value_counts().values
plt.pie(values, labels=labels, autopct="%1.2f%%",
startangle=90 + (45 * i), shadow=True,
colors=sns.color_palette("tab10"))
plt.title(f"{col.capitalize()} Distribution (Pie Chart)")
plt.subplot(4, 2, 2*i)
bar = sns.barplot(x=df[col].value_counts().index,
y=df[col].value_counts().values,
palette="tab10")
add_labels(bar)
plt.title(f"{col.capitalize()} Distribution (Bar Plot)")
plt.suptitle("Distribution of Categories", fontsize=30)
plt.tight_layout(rect=[0.0, 0.03, 1, 0.99])
plt.show()
plt.figure(figsize=(15, 22))
colors = sns.color_palette("tab10", n_colors=14)
columns = ['temp', 'atemp', 'humidity', 'windspeed', 'casual', 'registered', 'count']
for i, col in enumerate(columns, 1):
plt.subplot(7, 2, 2*i - 1)
sns.kdeplot(df[col], fill=True, color =colors[i])
plt.title(f"KDE Plot of {col}")
plt.subplot(7, 2, 2*i)
sns.boxplot(x=df[col], palette=[colors[i]])
plt.title(f"Boxplot of {col}")
plt.suptitle("Distribution of Numerical Columns", fontsize=30)
plt.tight_layout(rect=[0.0, 0.03, 1, 0.99])
plt.show()
plt.figure(figsize=(15, 15))
colors = sns.color_palette("Dark2", n_colors=14)
columns = ['workingday', 'holiday', 'weather', 'season']
for i, col in enumerate(columns, 1):
plt.subplot(2, 2, i)
sns.boxplot(x=df[col], y=df['count'], palette=colors)
plt.title(f"Distribution of Count by {col}")
plt.suptitle("Relationships between Workingday and Count,\n Holiday and Count, Season and Count, Weather and Count", fontsize=25)
plt.tight_layout(rect=[0.0, 0.03, 1, 0.99])
plt.show()
plt.figure(figsize=(15, 15))
colors = sns.color_palette("Dark2", n_colors=14)
columns = ['temp', 'humidity', 'atemp', 'windspeed']
for i, col in enumerate(columns, 1):
plt.subplot(2, 2, i)
sns.histplot(df[col], kde=True, color=colors[i])
plt.title(f"Distribution of Count by {col}")
plt.suptitle("Distributions of continuous variables", fontsize=25)
plt.tight_layout(rect=[0.0, 0.03, 1, 0.99])
plt.show()
corr = df.corr(numeric_only = True)
corr
| temp | atemp | humidity | windspeed | casual | registered | count | |
|---|---|---|---|---|---|---|---|
| temp | 1.000000 | 0.984948 | -0.064949 | -0.017852 | 0.467097 | 0.318571 | 0.394454 |
| atemp | 0.984948 | 1.000000 | -0.043536 | -0.057473 | 0.462067 | 0.314635 | 0.389784 |
| humidity | -0.064949 | -0.043536 | 1.000000 | -0.318607 | -0.348187 | -0.265458 | -0.317371 |
| windspeed | -0.017852 | -0.057473 | -0.318607 | 1.000000 | 0.092276 | 0.091052 | 0.101369 |
| casual | 0.467097 | 0.462067 | -0.348187 | 0.092276 | 1.000000 | 0.497250 | 0.690414 |
| registered | 0.318571 | 0.314635 | -0.265458 | 0.091052 | 0.497250 | 1.000000 | 0.970948 |
| count | 0.394454 | 0.389784 | -0.317371 | 0.101369 | 0.690414 | 0.970948 | 1.000000 |
plt.figure(figsize = (15,10))
sns.heatmap(corr, annot = True, cmap = "Dark2", fmt=".2f", linewidths=0.5)
plt.title("Heatmap of Correlation Matrix", fontsize=16)
plt.show()
sns.pairplot(df, hue = "season", diag_kind='kde', markers='o', plot_kws={'alpha':0.5})
plt.suptitle("Pairplot of Variables Colored by Season", fontsize=30, y=1.02)
plt.show()
Null Hypothesis(H0): There is No Significant difference between booking of electric cycles on Weekdays and Weekends.
Alternate Hypothesis(Ha): There is Significant difference between booking of electric cycles on Weekdays and Weekends.
plt.figure(figsize=(10, 10))
weekdays = df[df["workingday"] == "Yes"]["count"]
weekends = df[df["workingday"] == "No"]["count"]
plt.subplot(2, 2, 1)
sns.histplot(weekdays, kde=True, color=sns.color_palette("Dark2")[4]).lines[0].set_color("red")
plt.title("Distribution of Electric Cycles Rented on Weekdays")
plt.subplot(2, 2, 3)
sns.histplot(weekends, kde=True, color=sns.color_palette("Dark2")[1]).lines[0].set_color("blue")
plt.title("Distribution of Electric Cycles Rented on Weekends")
# QQ Plot
plt.subplot(2, 2, 2)
#qqplot(weekdays , line = 's')
probplot(weekdays, dist="norm", plot=plt)
plt.title("Q-Q Plot for Weekdays Data")
plt.subplot(2, 2, 4)
#qqplot(weekends , line = 's')
probplot(weekends, dist="norm", plot=plt)
plt.title("Q-Q Plot for Weekends Data")
plt.suptitle("Checking Normality Distribution for Weekdays vs Weekends", fontsize=25)
plt.tight_layout(rect=[0.0, 0.03, 1, 0.99])
plt.show()
Insights :
print("*"*50)
print("Week Day Shapiro-Wilk test distribution")
print("*"*50)
stat,p_value = shapiro(weekdays.sample(1000))
print(f"Test Statistics : {stat}")
print(f"P_value : {p_value}")
alpha = 0.05
if p_value < alpha:
print("Weekday distribution does not follow normal distribution")
else:
print("Weekday distribution follow normal distribution")
************************************************** Week Day Shapiro-Wilk test distribution ************************************************** Test Statistics : 0.8707542419433594 P_value : 5.880764010980746e-28 Weekday distribution does not follow normal distribution
print("*"*50)
print("Week End Shapiro-Wilk test distribution")
print("*"*50)
stat,p_value = shapiro(weekends.sample(1000))
print(f"Test Statistics : {stat}")
print(f"P_value : {p_value}")
alpha = 0.05
if p_value < alpha:
print("Weekend distribution does not follow normal distribution")
else:
print("Weekend distribution follow normal distribution")
************************************************** Week End Shapiro-Wilk test distribution ************************************************** Test Statistics : 0.8797138929367065 P_value : 4.089346310216862e-27 Weekend distribution does not follow normal distribution
Insights :
print("*"*50)
print("Weekday Vs Weekend Levene's Test")
print("*"*50)
stat,p_value = levene(weekdays,weekends)
print(f"Test Statistics : {stat}")
print(f"P_value : {p_value}")
if p_value < alpha:
print("Both Groups don't have equal variance")
else:
print("Both Groups have equal variance")
************************************************** Weekday Vs Weekend Levene's Test ************************************************** Test Statistics : 0.004972848886504472 P_value : 0.9437823280916695 Both Groups have equal variance
Insights :
print("*"*50)
print("Two-Sample Independent t-test")
print("*"*50)
t_stat, p_value = ttest_ind(weekdays,weekends)
print(f"t_statistics: {t_stat}")
print(f"P_value: {p_value}")
# Assume Significance Level as 5%
alpha = 0.05
if p_value < alpha:
print("Reject Null Hypothesis")
print("There is Significant difference between booking of electric cycles on Weekdays and Weekends.")
else:
print("Failed to Reject Null Hypothesis")
print("There is No Significant difference between booking of electric cycles on Weekdays and Weekends.")
************************************************** Two-Sample Independent t-test ************************************************** t_statistics: 1.2096277376026694 P_value: 0.22644804226361348 Failed to Reject Null Hypothesis There is No Significant difference between booking of electric cycles on Weekdays and Weekends.
Based on the results of the two-sample independent t-test, we found that there is No significant difference between the number of electric cycles rentals on weekdays and weekends. Specifically:
Therefore, we fail to reject the null hypothesis
Null Hypothesis(H0): There is No significant difference in the number of cycles rented on regular days and holidays.
Alternate Hypothesis(Ha): There is significantly different average number of cycles rented on regular days and holidays.
plt.figure(figsize=(10, 10))
regulardays = df[df["holiday"] == "Yes"]["count"]
holidays = df[df["holiday"] == "No"]["count"]
plt.subplot(2, 2, 1)
sns.histplot(regulardays, kde=True, color=sns.color_palette("Dark2")[4]).lines[0].set_color("red")
plt.title("Distribution of Electric Cycles Rented on Regulardays")
plt.subplot(2, 2, 3)
sns.histplot(holidays, kde=True, color=sns.color_palette("Dark2")[1]).lines[0].set_color("blue")
plt.title("Distribution of Electric Cycles Rented on Holidays")
# QQ Plot
plt.subplot(2, 2, 2)
#qqplot(weekdays , line = 's')
probplot(regulardays, dist="norm", plot=plt)
plt.title("Q-Q Plot for Regulardays Data")
plt.subplot(2, 2, 4)
#qqplot(weekends , line = 's')
probplot(holidays, dist="norm", plot=plt)
plt.title("Q-Q Plot for Holidays Data")
plt.suptitle("Checking Normality Distribution for Regulardays vs Holidays", fontsize=25)
plt.tight_layout(rect=[0.0, 0.03, 1, 0.99])
plt.show()
Insights :
print("*"*50)
print("Regular Days Shapiro-Wilk test distribution")
print("*"*50)
stat,p_value = shapiro(regulardays.sample(300))
print(f"Test Statistics : {stat}")
print(f"P_value : {p_value}")
alpha = 0.05
if p_value < alpha:
print("RegularDays distribution does not follow normal distribution")
else:
print("RegularDays distribution follow normal distribution")
************************************************** Regular Days Shapiro-Wilk test distribution ************************************************** Test Statistics : 0.8966464400291443 P_value : 1.9672651908105715e-13 RegularDays distribution does not follow normal distribution
print("*"*50)
print("Holidays Shapiro-Wilk test distribution")
print("*"*50)
stat,p_value = shapiro(holidays.sample(300))
print(f"Test Statistics : {stat}")
print(f"P_value : {p_value}")
alpha = 0.05
if p_value < alpha:
print("Holidays distribution does not follow normal distribution")
else:
print("Holidays distribution follow normal distribution")
************************************************** Holidays Shapiro-Wilk test distribution ************************************************** Test Statistics : 0.8657498955726624 P_value : 1.748727365814644e-15 Holidays distribution does not follow normal distribution
Insights :
print("*"*50)
print("Regulardays Vs Holidays Levene's Test")
print("*"*50)
stat,p_value = levene(regulardays,holidays)
print(f"Test Statistics : {stat}")
print(f"P_value : {p_value}")
if p_value < alpha:
print("Both Groups don't have equal variance")
else:
print("Both Groups have equal variance")
************************************************** Regulardays Vs Holidays Levene's Test ************************************************** Test Statistics : 1.222306875221986e-06 P_value : 0.9991178954732041 Both Groups have equal variance
Insights :
print("*"*50)
print("Two-Sample Independent t-test for Regulardays vs Holidays")
print("*"*50)
t_stat, p_value = ttest_ind(regulardays,holidays)
print(f"t_statistics: {t_stat}")
print(f"P_value: {p_value}")
# Assume Significance Level as 5%
alpha = 0.05
if p_value < alpha:
print("Reject Null Hypothesis")
print("There is Significant difference between booking of electric cycles on Regulardays and Holidays.")
else:
print("Failed to Reject Null Hypothesis")
print("There is No Significant difference between booking of electric cycles on Regulardays and Holidays.")
************************************************** Two-Sample Independent t-test for Regulardays vs Holidays ************************************************** t_statistics: -0.5626388963477119 P_value: 0.5736923883271103 Failed to Reject Null Hypothesis There is No Significant difference between booking of electric cycles on Regulardays and Holidays.
Result for Two-Sample Independent t-test for Regulardays vs Holidays:
Null Hypothesis(H0): There is No significant difference in the number of cycles rented across the different seasons.
Alternate Hypothesis(Ha): There is significantly different average number of cycles rented compared to the others.
plt.figure(figsize=(10, 15))
spring = df[df["season"]== "Spring"]["count"]
summer = df[df["season"]== "Summer"]["count"]
fall = df[df["season"]== "Fall"]["count"]
winter = df[df["season"]== "Winter"]["count"]
plt.subplot(4, 2, 1)
sns.histplot(spring, kde=True, color=sns.color_palette("Dark2")[1]).lines[0].set_color("violet")
plt.title("Distribution of Electric Cycles Rented during spring")
plt.subplot(4, 2, 3)
sns.histplot(summer, kde=True, color=sns.color_palette("Dark2")[2]).lines[0].set_color("red")
plt.title("Distribution of Electric Cycles Rented during summer")
plt.subplot(4, 2, 5)
sns.histplot(fall, kde=True, color=sns.color_palette("Dark2")[3]).lines[0].set_color("green")
plt.title("Distribution of Electric Cycles Rented during fall")
plt.subplot(4, 2, 7)
sns.histplot(winter, kde=True, color=sns.color_palette("Dark2")[4]).lines[0].set_color("orange")
plt.title("Distribution of Electric Cycles Rented during winter")
# QQ Plot
plt.subplot(4, 2, 2)
probplot(spring, dist="norm", plot=plt)
plt.title("Q-Q Plot for spring Data")
plt.subplot(4, 2, 4)
probplot(summer, dist="norm", plot=plt)
plt.title("Q-Q Plot for summer Data")
plt.subplot(4, 2, 6)
probplot(fall, dist="norm", plot=plt)
plt.title("Q-Q Plot for fall Data")
plt.subplot(4, 2, 8)
probplot(winter, dist="norm", plot=plt)
plt.title("Q-Q Plot for winter Data")
plt.suptitle("Checking Normality Distribution of \n Electric Cycles Rented during different Seasons", fontsize=25)
plt.tight_layout(rect=[0.0, 0.03, 1, 0.99])
plt.show()
Insights :
print("*"*50)
print("Spring Season Shapiro-Wilk test distribution")
print("*"*50)
stat,p_value = shapiro(spring.sample(1000))
print(f"Test Statistics : {stat}")
print(f"P_value : {p_value}")
alpha = 0.05
if p_value < alpha:
print("Spring Season distribution does not follow normal distribution")
else:
print("Spring Season distribution follow normal distribution")
************************************************** Spring Season Shapiro-Wilk test distribution ************************************************** Test Statistics : 0.8035904169082642 P_value : 3.9111643134574254e-33 Spring Season distribution does not follow normal distribution
print("*"*50)
print("Summer Season Shapiro-Wilk test distribution")
print("*"*50)
stat,p_value = shapiro(summer.sample(1000))
print(f"Test Statistics : {stat}")
print(f"P_value : {p_value}")
alpha = 0.05
if p_value < alpha:
print("Summer Season distribution does not follow normal distribution")
else:
print("Summer Season distribution follow normal distribution")
************************************************** Summer Season Shapiro-Wilk test distribution ************************************************** Test Statistics : 0.9018007516860962 P_value : 8.248538180092717e-25 Summer Season distribution does not follow normal distribution
print("*"*50)
print("Fall Season Shapiro-Wilk test distribution")
print("*"*50)
stat,p_value = shapiro(fall.sample(1000))
print(f"Test Statistics : {stat}")
print(f"P_value : {p_value}")
alpha = 0.05
if p_value < alpha:
print("Fall Season distribution does not follow normal distribution")
else:
print("Fall Season distribution follow normal distribution")
************************************************** Fall Season Shapiro-Wilk test distribution ************************************************** Test Statistics : 0.9141510725021362 P_value : 2.418179740890826e-23 Fall Season distribution does not follow normal distribution
print("*"*50)
print("Winter Season Shapiro-Wilk test distribution")
print("*"*50)
stat,p_value = shapiro(winter.sample(1000))
print(f"Test Statistics : {stat}")
print(f"P_value : {p_value}")
alpha = 0.05
if p_value < alpha:
print("Winter Season distribution does not follow normal distribution")
else:
print("Winter Season distribution follow normal distribution")
************************************************** Winter Season Shapiro-Wilk test distribution ************************************************** Test Statistics : 0.8978354930877686 P_value : 2.986761986558704e-25 Winter Season distribution does not follow normal distribution
Insights :
print("*"*50)
print("Levene's Test For All Four Seasons")
print("*"*50)
stat,p_value = levene(spring, summer, fall, winter)
print(f"Test Statistics : {stat}")
print(f"P_value : {p_value}")
if p_value < alpha:
print("All Four Seasons don't have equal variance")
else:
print("All Four Seasons have equal variance")
************************************************** Levene's Test For All Four Seasons ************************************************** Test Statistics : 187.7706624026276 P_value : 1.0147116860043298e-118 All Four Seasons don't have equal variance
Insights :
Two of the three conditions for ANOVA are not met, we will still proceed with the ANOVA test.
we will conduct Kruskal's test for comparison.
If there are any discrepancies between the results, we will rely on Kruskal's test, as the data does not fully meet the assumptions required for ANOVA.
Null Hypothesis(H0): There is No significant difference in the number of cycles rented across the different seasons.
Alternate Hypothesis(Ha): There is significantly different average number of cycles rented compared to the others.
print("*"*50)
print("ANOVA Test")
print("*"*50)
stat,p_value = f_oneway(spring,summer,fall,winter)
print(f"t_statistics: {stat}")
print(f"P_value: {p_value}")
# Assume Significance Level as 5%
alpha = 0.05
if p_value < alpha:
print("Reject Null Hypothesis")
print("There is a Significantly different average number of cycles rented compared to the other seasons.")
else:
print("Failed to Reject Null Hypothesis")
print("There is No significant difference in the number of cycles rented across the different seasons.")
************************************************** ANOVA Test ************************************************** t_statistics: 236.94671081032106 P_value: 6.164843386499654e-149 Reject Null Hypothesis There is a Significantly different average number of cycles rented compared to the other seasons.
Insights :
ANOVA Test:
print("*"*50)
print("Kruskal-Wallis Test")
print("*"*50)
stat,p_value = kruskal(spring,summer,fall,winter)
print(f"t_statistics: {stat}")
print(f"P_value: {p_value}")
# Assume Significance Level as 5%
alpha = 0.05
if p_value < alpha:
print("Reject Null Hypothesis")
print("There is a Significantly different average number of cycles rented compared to the other seasons.")
else:
print("Failed to Reject Null Hypothesis")
print("There is No significant difference in the number of cycles rented across the different seasons.")
************************************************** Kruskal-Wallis Test ************************************************** t_statistics: 699.6668548181988 P_value: 2.479008372608633e-151 Reject Null Hypothesis There is a Significantly different average number of cycles rented compared to the other seasons.
Insights :
Kruskal-Wallis Test:
So, there will be 6 possible pairs to compare:
print("*"*50)
print("Two-Sample Independent t-test for Spring vs Summer")
print("*"*50)
t_stat, p_value = ttest_ind(spring,summer)
print(f"t_statistics: {t_stat}")
print(f"P_value: {p_value}")
# Assume Significance Level as 5%
alpha = 0.05
if p_value < alpha:
print("Reject Null Hypothesis")
print("There is a Significantly different average number of cycles rented in Spring vs Summer.")
else:
print("Failed to Reject Null Hypothesis")
print("There is No significant difference in the number of cycles rented across Spring vs Summer.")
************************************************** Two-Sample Independent t-test for Spring vs Summer ************************************************** t_statistics: -22.41673852194779 P_value: 1.6578587340400095e-106 Reject Null Hypothesis There is a Significantly different average number of cycles rented in Spring vs Summer.
Insights :
Spring vs Summer:
print("*"*50)
print("Two-Sample Independent t-test for Spring vs Fall")
print("*"*50)
t_stat, p_value = ttest_ind(spring,fall)
print(f"t_statistics: {t_stat}")
print(f"P_value: {p_value}")
# Assume Significance Level as 5%
alpha = 0.05
if p_value < alpha:
print("Reject Null Hypothesis")
print("There is a Significantly different average number of cycles rented in Spring vs Fall.")
else:
print("Failed to Reject Null Hypothesis")
print("There is No significant difference in the number of cycles rented across Spring vs Fall.")
************************************************** Two-Sample Independent t-test for Spring vs Fall ************************************************** t_statistics: -26.262602569974415 P_value: 3.403850435531097e-143 Reject Null Hypothesis There is a Significantly different average number of cycles rented in Spring vs Fall.
Insights :
Spring vs Fall:
print("*"*50)
print("Two-Sample Independent t-test for Spring vs Winter")
print("*"*50)
t_stat, p_value = ttest_ind(spring,winter)
print(f"t_statistics: {t_stat}")
print(f"P_value: {p_value}")
# Assume Significance Level as 5%
alpha = 0.05
if p_value < alpha:
print("Reject Null Hypothesis")
print("There is a Significantly different average number of cycles rented in Spring vs Winter.")
else:
print("Failed to Reject Null Hypothesis")
print("There is No significant difference in the number of cycles rented across Spring vs Winter.")
************************************************** Two-Sample Independent t-test for Spring vs Winter ************************************************** t_statistics: -19.763761227758852 P_value: 5.236417429066782e-84 Reject Null Hypothesis There is a Significantly different average number of cycles rented in Spring vs Winter.
Insights :
Spring vs Winter:
print("*"*50)
print("Two-Sample Independent t-test for Summer vs Fall")
print("*"*50)
t_stat, p_value = ttest_ind(summer,fall)
print(f"t_statistics: {t_stat}")
print(f"P_value: {p_value}")
# Assume Significance Level as 5%
alpha = 0.05
if p_value < alpha:
print("Reject Null Hypothesis")
print("There is a Significantly different average number of cycles rented in Summer vs Fall.")
else:
print("Failed to Reject Null Hypothesis")
print("There is No significant difference in the number of cycles rented across Summer vs Fall.")
************************************************** Two-Sample Independent t-test for Summer vs Fall ************************************************** t_statistics: -3.6407918229052068 P_value: 0.00027431561172498644 Reject Null Hypothesis There is a Significantly different average number of cycles rented in Summer vs Fall.
Insights :
Summer vs Fall:
#### **4. Summer VS Fall**
print("*"*50)
print("Two-Sample Independent t-test for Summer vs Winter")
print("*"*50)
t_stat, p_value = ttest_ind(summer,winter)
print(f"t_statistics: {t_stat}")
print(f"P_value: {p_value}")
# Assume Significance Level as 5%
alpha = 0.05
if p_value < alpha:
print("Reject Null Hypothesis")
print("There is a Significantly different average number of cycles rented in Summer vs Winter.")
else:
print("Failed to Reject Null Hypothesis")
print("There is No significant difference in the number of cycles rented across Summer vs Winter.")
************************************************** Two-Sample Independent t-test for Summer vs Winter ************************************************** t_statistics: 3.2507544346007022 P_value: 0.001157968169413171 Reject Null Hypothesis There is a Significantly different average number of cycles rented in Summer vs Winter.
Insights :
Summer vs Winter:
print("*"*50)
print("Two-Sample Independent t-test for Fall vs Winter")
print("*"*50)
t_stat, p_value = ttest_ind(fall,winter)
print(f"t_statistics: {t_stat}")
print(f"P_value: {p_value}")
# Assume Significance Level as 5%
alpha = 0.05
if p_value < alpha:
print("Reject Null Hypothesis")
print("There is a Significantly different average number of cycles rented in Fall vs Winter.")
else:
print("Failed to Reject Null Hypothesis")
print("There is No significant difference in the number of cycles rented across Fall vs Winter.")
************************************************** Two-Sample Independent t-test for Fall vs Winter ************************************************** t_statistics: 6.980360925184712 P_value: 3.294359667247495e-12 Reject Null Hypothesis There is a Significantly different average number of cycles rented in Fall vs Winter.
Insights :
Fall vs Winter:
Null Hypothesis(H0): There is No significant difference in the number of cycles rented across the different weather.
Alternate Hypothesis(Ha): There is significantly different average number of cycles rented compared to the weather.
plt.figure(figsize=(10, 15))
clear = df[df["weather"]== 1]["count"]
mist = df[df["weather"]== 2]["count"]
rain = df[df["weather"]== 3]["count"]
heavy_rain = df[df["weather"]== 4]["count"]
plt.subplot(4, 2, 1)
sns.histplot(clear, kde=True, color=sns.color_palette("Dark2")[1]).lines[0].set_color("red")
plt.title("Distribution of Electric Cycles Rented during Clear Weather Conditions")
plt.subplot(4, 2, 3)
sns.histplot(mist, kde=True, color=sns.color_palette("Dark2")[2]).lines[0].set_color("red")
plt.title("Distribution of Electric Cycles Rented during Mist Weather Conditions")
plt.subplot(4, 2, 5)
sns.histplot(rain, kde=True, color=sns.color_palette("Dark2")[3]).lines[0].set_color("green")
plt.title("Distribution of Electric Cycles Rented during Rain Weather Conditions")
plt.subplot(4, 2, 7)
sns.histplot(heavy_rain, kde=True, color=sns.color_palette("Dark2")[4])#.lines[0].set_color("orange")
plt.title("Distribution of Electric Cycles Rented during Heavy_Rain Weather Conditions")
# QQ Plot
plt.subplot(4, 2, 2)
probplot(clear, dist="norm", plot=plt)
plt.title("Q-Q Plot for Clear Weather Conditions")
plt.subplot(4, 2, 4)
probplot(mist, dist="norm", plot=plt)
plt.title("Q-Q Plot for Mist Weather Conditions")
plt.subplot(4, 2, 6)
probplot(rain, dist="norm", plot=plt)
plt.title("Q-Q Plot for Rain Weather Conditions")
plt.subplot(4, 2, 8)
probplot(heavy_rain, dist="norm", plot=plt)
plt.title("Q-Q Plot for Heavy_Rain Weather Conditions")
plt.suptitle("Checking Normality Distribution of \n Electric Cycles Rented during different Weather Conditions", fontsize=25)
plt.tight_layout(rect=[0.0, 0.03, 1, 0.99])
plt.show()
Insights :
print("*"*50)
print("Clear Weather Conditions Shapiro-Wilk test distribution")
print("*"*50)
stat,p_value = shapiro(clear.sample(1000))
print(f"Test Statistics : {stat}")
print(f"P_value : {p_value}")
alpha = 0.05
if p_value < alpha:
print("Clear Weather Conditions distribution does not follow normal distribution")
else:
print("Clear Weather Conditions distribution follow normal distribution")
************************************************** Clear Weather Conditions Shapiro-Wilk test distribution ************************************************** Test Statistics : 0.8943600654602051 P_value : 1.2564953079052152e-25 Clear Weather Conditions distribution does not follow normal distribution
print("*"*50)
print("Mist Weather Conditions Shapiro-Wilk test distribution")
print("*"*50)
stat,p_value = shapiro(mist.sample(1000))
print(f"Test Statistics : {stat}")
print(f"P_value : {p_value}")
alpha = 0.05
if p_value < alpha:
print("Mist Weather Conditions distribution does not follow normal distribution")
else:
print("Mist Weather Conditions distribution follow normal distribution")
************************************************** Mist Weather Conditions Shapiro-Wilk test distribution ************************************************** Test Statistics : 0.8771512508392334 P_value : 2.3223832012183317e-27 Mist Weather Conditions distribution does not follow normal distribution
print("*"*50)
print("Rain Weather Conditions Shapiro-Wilk test distribution")
print("*"*50)
stat,p_value = shapiro(rain.sample(100))
print("Due To not enough data to process we go with 100 samples for this test")
print(f"Test Statistics : {stat}")
print(f"P_value : {p_value}")
alpha = 0.05
if p_value < alpha:
print("Rain Weather Conditions distribution does not follow normal distribution")
else:
print("Rain Weather Conditions distribution follow normal distribution")
************************************************** Rain Weather Conditions Shapiro-Wilk test distribution ************************************************** Due To not enough data to process we go with 100 samples for this test Test Statistics : 0.7801065444946289 P_value : 6.339263142196572e-11 Rain Weather Conditions distribution does not follow normal distribution
print("*"*50)
print("Heavy_Rain Weather Conditions Shapiro-Wilk test distribution")
print("*"*50)
#stat,p_value = shapiro(heavy_rain.sample(1))
#print(f"Test Statistics : {stat}")
#print(f"P_value : {p_value}")
#alpha = 0.05
#if p_value < alpha:
# print("Clear Weather Conditions distribution does not follow normal distribution")
#else:
# print("Clear Weather Conditions distribution follow normal distribution")
print("Due To not enough data to process we cannot process this test")
************************************************** Heavy_Rain Weather Conditions Shapiro-Wilk test distribution ************************************************** Due To not enough data to process we cannot process this test
Insights :
print("*"*50)
print("Levene's Test For All Four Weather Conditions")
print("*"*50)
stat,p_value = levene(clear, mist, rain, heavy_rain)
print(f"Test Statistics : {stat}")
print(f"P_value : {p_value}")
if p_value < alpha:
print("All Four Weather Conditions don't have equal variance")
else:
print("All Four Weather Conditions have equal variance")
************************************************** Levene's Test For All Four Weather Conditions ************************************************** Test Statistics : 54.85106195954556 P_value : 3.504937946833238e-35 All Four Weather Conditions don't have equal variance
Insights :
Two of the three conditions for ANOVA are not met, we will still proceed with the ANOVA test.
we will conduct Kruskal's test for comparison.
If there are any discrepancies between the results, we will rely on Kruskal's test, as the data does not fully meet the assumptions required for ANOVA.
Null Hypothesis(H0): There is No significant difference in the number of cycles rented across the different Weather Conditions.
Alternate Hypothesis(Ha): There is significantly different average number of cycles rented compared to the others.
print("*"*50)
print("ANOVA Test for different Weather Conditions")
print("*"*50)
stat,p_value = f_oneway(clear, mist, rain, heavy_rain)
print(f"t_statistics: {stat}")
print(f"P_value: {p_value}")
# Assume Significance Level as 5%
alpha = 0.05
if p_value < alpha:
print("Reject Null Hypothesis")
print("There is a Significantly different average number of cycles rented compared to the other Weather Conditions.")
else:
print("Failed to Reject Null Hypothesis")
print("There is No significant difference in the number of cycles rented across the different Weather Conditions.")
************************************************** ANOVA Test for different Weather Conditions ************************************************** t_statistics: 65.53024112793271 P_value: 5.482069475935669e-42 Reject Null Hypothesis There is a Significantly different average number of cycles rented compared to the other Weather Conditions.
Insights :
ANOVA Test for Different Weather Conditions:
print("*"*50)
print("Kruskal-Wallis Test for different Weather Conditions")
print("*"*50)
stat,p_value = kruskal(clear, mist, rain, heavy_rain)
print(f"t_statistics: {stat}")
print(f"P_value: {p_value}")
# Assume Significance Level as 5%
alpha = 0.05
if p_value < alpha:
print("Reject Null Hypothesis")
print("There is a Significantly different average number of cycles rented compared to the other Weather Conditions.")
else:
print("Failed to Reject Null Hypothesis")
print("There is No significant difference in the number of cycles rented across the different Weather Conditions.")
************************************************** Kruskal-Wallis Test for different Weather Conditions ************************************************** t_statistics: 205.00216514479087 P_value: 3.501611300708679e-44 Reject Null Hypothesis There is a Significantly different average number of cycles rented compared to the other Weather Conditions.
Insights :
Kruskal-Wallis Test for Different Weather Conditions:
So, there will be 6 possible pairs to compare:
print("*"*50)
print("Two-Sample Independent t-test for Clear vs Mist")
print("*"*50)
t_stat, p_value = ttest_ind(clear,mist)
print(f"t_statistics: {t_stat}")
print(f"P_value: {p_value}")
# Assume Significance Level as 5%
alpha = 0.05
if p_value < alpha:
print("Reject Null Hypothesis")
print("There is a Significantly different average number of cycles rented in Clear vs Mist.")
else:
print("Failed to Reject Null Hypothesis")
print("There is No significant difference in the number of cycles rented across Clear vs Mist.")
************************************************** Two-Sample Independent t-test for Clear vs Mist ************************************************** t_statistics: 6.488169251217751 P_value: 9.098916216508542e-11 Reject Null Hypothesis There is a Significantly different average number of cycles rented in Clear vs Mist.
Insights :
Two-Sample Independent t-test for Clear vs Mist:
print("*"*50)
print("Two-Sample Independent t-test for Clear vs Rain")
print("*"*50)
t_stat, p_value = ttest_ind(clear,rain)
print(f"t_statistics: {t_stat}")
print(f"P_value: {p_value}")
# Assume Significance Level as 5%
alpha = 0.05
if p_value < alpha:
print("Reject Null Hypothesis")
print("There is a Significantly different average number of cycles rented in Clear vs Rain.")
else:
print("Failed to Reject Null Hypothesis")
print("There is No significant difference in the number of cycles rented across Clear vs Rain.")
************************************************** Two-Sample Independent t-test for Clear vs Rain ************************************************** t_statistics: 13.05352692528198 P_value: 1.4918709771846276e-38 Reject Null Hypothesis There is a Significantly different average number of cycles rented in Clear vs Rain.
Insights :
Two-Sample Independent t-test for Clear vs Rain:
print("*"*50)
print("Two-Sample Independent t-test for Clear vs Heavy Rain")
print("*"*50)
#t_stat, p_value = ttest_ind(clear,heavy_rain)
#print(f"t_statistics: {t_stat}")
#print(f"P_value: {p_value}")
#
# Assume Significance Level as 5%
#alpha = 0.05
#if p_value < alpha:
# print("Reject Null Hypothesis")
# print("There is a Significantly different average number of cycles rented in Clear vs Heavy Rain.")
#else:
# print("Failed to Reject Null Hypothesis")
# print("There is No significant difference in the number of cycles rented across Clear vs Heavy Rain.")
print("Due to only one booking in Heavy_rain we cannot test with this data")
************************************************** Two-Sample Independent t-test for Clear vs Heavy Rain ************************************************** Due to only one booking in Heavy_rain we cannot test with this data
Insights :
Two-Sample Independent t-tests for Weather Conditions with Heavy Rain:
print("*"*50)
print("Two-Sample Independent t-test for Mist vs Rain")
print("*"*50)
t_stat, p_value = ttest_ind(mist,rain)
print(f"t_statistics: {t_stat}")
print(f"P_value: {p_value}")
# Assume Significance Level as 5%
alpha = 0.05
if p_value < alpha:
print("Reject Null Hypothesis")
print("There is a Significantly different average number of cycles rented in Mist vs Rain.")
else:
print("Failed to Reject Null Hypothesis")
print("There is No significant difference in the number of cycles rented across Mist vs Rain.")
************************************************** Two-Sample Independent t-test for Mist vs Rain ************************************************** t_statistics: 9.53048112515673 P_value: 2.7459673190273642e-21 Reject Null Hypothesis There is a Significantly different average number of cycles rented in Mist vs Rain.
Insights :
Two-Sample Independent t-test for Mist vs Rain:
#### **4. Summer VS Fall**
print("*"*50)
print("Two-Sample Independent t-test for Mist vs Heavy Rain")
print("*"*50)
#t_stat, p_value = ttest_ind(summer,winter)
#print(f"t_statistics: {t_stat}")
#print(f"P_value: {p_value}")
#
# Assume Significance Level as 5%
#alpha = 0.05
#if p_value < alpha:
# print("Reject Null Hypothesis")
# print("There is a Significantly different average number of cycles rented in Summer vs Winter.")
#else:
# print("Failed to Reject Null Hypothesis")
# print("There is No significant difference in the number of cycles rented across Summer vs Winter.")
print("Due to only one booking in Heavy_rain we cannot test with this data")
************************************************** Two-Sample Independent t-test for Mist vs Heavy Rain ************************************************** Due to only one booking in Heavy_rain we cannot test with this data
Insights :
Two-Sample Independent t-tests for Weather Conditions with Heavy Rain:
#### **4. Summer VS Fall**
print("*"*50)
print("Two-Sample Independent t-test for Rain vs Heavy Rain")
print("*"*50)
#t_stat, p_value = ttest_ind(summer,winter)
#print(f"t_statistics: {t_stat}")
#print(f"P_value: {p_value}")
#
# Assume Significance Level as 5%
#alpha = 0.05
#if p_value < alpha:
# print("Reject Null Hypothesis")
# print("There is a Significantly different average number of cycles rented in Summer vs Winter.")
#else:
# print("Failed to Reject Null Hypothesis")
# print("There is No significant difference in the number of cycles rented across Summer vs Winter.")
print("Due to only one booking in Heavy_rain we cannot test with this data")
************************************************** Two-Sample Independent t-test for Rain vs Heavy Rain ************************************************** Due to only one booking in Heavy_rain we cannot test with this data
Insights :
Two-Sample Independent t-tests for Weather Conditions with Heavy Rain:
Null Hypothesis(H0): Weather is independent of the season. There is no association between the two variables.
Alternate Hypothesis(Ha): Weather is dependent on the season. There is an association between the two variables.
conti_table = pd.crosstab(df['weather'], df['season'])
conti_table
| season | Fall | Spring | Summer | Winter |
|---|---|---|---|---|
| weather | ||||
| 1 | 1930 | 1759 | 1801 | 1702 |
| 2 | 604 | 715 | 708 | 807 |
| 3 | 199 | 211 | 224 | 225 |
| 4 | 0 | 1 | 0 | 0 |
print("*"*50)
print("Chi-Square Test")
print("*"*50)
stat, p_value, dof, expected = chi2_contingency(conti_table)
print(f"Chi-Square Statistic: {stat}")
print(f"p-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
if p_value < alpha:
print("Reject the Null Hypothesis: Weather is dependent on the season.")
else:
print("Fail to Reject the Null Hypothesis: Weather is independent of the season.")
************************************************** Chi-Square Test ************************************************** Chi-Square Statistic: 49.15865559689363 p-value: 1.5499250736864862e-07 Degrees of Freedom: 9 Reject the Null Hypothesis: Weather is dependent on the season.
Insights :
Chi-Square Test for Weather Dependency on Season: